Generable
I don’t work in infectious disease modeling, instead I analyze:
I’ve always been interested in different modeling approaches. There is a ton of value in cross-disciplinary pollination.
I work in multiple programming languages:
This is a 3-compartment ODE model, in which a person at any one time is either:
Image source: https://mc-stan.org/users/documentation/case-studies/boarding_school_case_study.html
Each of these states is termed a “compartment”, and we model the size of each compartment at time \(t\) as the number of individuals in this state at that time.
It is described by a system of ODEs:
\[ \begin{aligned} \frac{dS}{dt} &= -\beta S \frac{I}{N}\\ \frac{dI}{dt} &= \beta S \frac{I}{N} - \gamma I \\ \frac{dR}{dt} &= \gamma I \end{aligned} \]
The number of infections in any day or week will depend on the sizes of these compartments, and:
\[ R_0 = \frac{\beta}{\gamma} \]
This number is called the basic reproduction number, reflecting the expected number of new infections from a single infected person in a population of susceptible people.
src: https://en.wikipedia.org/wiki/Measles
I’m not going to go into details regarding Bayesian inference, statistics, or Stan.
If none of these are familiar to you, treat this as a possibly motivating example. There are lots of great resources on Bayesian analysis and Stan on the internet; nothing I present here will help you that much.
Also: Not all the code has been shown here in the slides. See the source qmd file for details.
Before we get started (if you want to follow along):
install_cmdstan()We are going to start by implementing the ode function body in Stan code:
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real beta = theta[1];
real gamma = theta[2];
real S = y[1];
real I = y[2];
real R = y[3];
real P = S + I + R;
vector[3] dy_dt;
dy_dt[1] = -beta * I * S / P;
dy_dt[2] = beta * I * S / P - gamma * I;
dy_dt[3] = gamma * I;
return(dy_dt);
}
}We can now write a minimal Stan program that will allow us to simulate different population dynamics.
We write all our methods as functions, so we can re-use them later.
We will save this in a file: sir-simulation.stan
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real beta = theta[1];
real gamma = theta[2];
real S = y[1];
real I = y[2];
real R = y[3];
real N = S + I + R;
real dy_dt[3];
dy_dt[1] = -beta * I * S / N;
dy_dt[2] = beta * I * S / N - gamma * I;
dy_dt[3] = gamma * I;
return(dy_dt);
}
real[,] simulate_sir(real[] t, data int N, data int i0, real beta, real gamma) {
int N_t = num_elements(t);
real y0[3] = {N - i0 + 0.0, i0 + 0.0, 0.0};
real t0 = 0;
real theta[2] = {beta, gamma};
real y[N_t, 3];
y = integrate_ode_rk45(sir, y0, t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
return(y);
}
}
model {
}We can now use the rstan package to expose these as R functions.
This is a powerful tool for model interrogation; the functions are written in Stan code, transpiled to C++, and exposed to R through Rcpp.
library(rstan)
expose_stan_functions('model/sir-simulation.stan')
t <- seq(from = 0, to = 25, by = 0.1)
sim_df <- simulate_sir(t = t,
N = 1000,
i0 = 1,
beta = 4,
gamma = 1/4
) %>%
map(set_names, c('S', 'I', 'R')) %>%
map_dfr(as_tibble_row, .id = 't_index') %>%
mutate(t_index = as.integer(t_index)) %>%
left_join(tibble(t = t) %>% mutate(t_index = row_number()))
sim_df %>%
tidyr::gather(state, N, S, I, R) %>%
ggplot(aes(x = t, y = N, colour = state)) +
geom_line(linewidth = 2) +
ggtitle('Simulated Population Dynamics') +
theme(text = element_text(size = 15))Of course, the data that are observed are rarely this clean.
Let’s assume we have data for daily counts of cases, and that we will model this as a negative binomial observation model.
We add a new function to our data simulation for the observed data:
Now, we can update our simulation to include both the assumed “true” population incidence and the observed counts:
expose_stan_functions('model/sir-simulation.stan')
case_t <- seq(from = 0, to = max(t), by = 1)
sim_cases <- simulate_cases_rng(t = case_t,
N = 1000,
i0 = 1,
beta = 4,
gamma = 1/4,
phi = 10,
) %>%
tibble(cases = .) %>%
bind_cols(t = case_t)Using rstan::expose_stan_functions to integrate this ODE system and sample daily counts:
Next, we evolve our Stan program to include estimating the likelihood (along with the priors) given data.
This will allow us to fit the model to our simulated data.
functions {
real[] sir(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real beta = theta[1];
real gamma = theta[2];
real S = y[1];
real I = y[2];
real R = y[3];
real N = S + I + R;
real dy_dt[3];
dy_dt[1] = -beta * I * S / N;
dy_dt[2] = beta * I * S / N - gamma * I;
dy_dt[3] = gamma * I;
return(dy_dt);
}
real[,] simulate_sir(real[] t, data int N, data int i0, real beta, real gamma) {
int N_t = num_elements(t);
real y0[3] = {N - i0 + 0.0, i0 + 0.0, 0.0};
real t0 = -0.1;
real theta[2] = {beta, gamma};
real y[N_t, 3];
y = integrate_ode_rk45(sir, y0, t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
return(y);
}
int[] simulate_cases_rng(real[] t, data int N, data int i0, real beta, real gamma, real phi) {
int N_t = num_elements(t);
real y[N_t, 3] = simulate_sir(t, N, i0, beta, gamma);
int cases[N_t] = neg_binomial_2_rng(y[,2], phi);
return(cases);
}
}
data {
int N_t;
real t[N_t];
int cases[N_t];
int N;
int i0;
}
parameters {
real<lower = 0> beta;
real<lower = 0> gamma;
real<lower = 0> phi_inv;
}
transformed parameters {
real phi = 1. / phi_inv;
real ys[N_t, 3] = simulate_sir(t, N, i0, beta, gamma);
}
model {
// priors
phi_inv ~ exponential(5);
beta ~ normal(0, 5);
gamma ~ normal(0.4, 0.5);
// likelihood
cases ~ neg_binomial_2(ys[,2], phi);
}
generated quantities {
int yrep[N_t] = neg_binomial_2_rng(ys[,2], phi);
real log_lik[N_t];
for (i in 1:N_t)
log_lik[i] = neg_binomial_2_lpmf(cases[i] | ys[i,2], phi);
}
Next, we use tidybayes to summarize inferences for key parameters. Formatting posterior estimates takes some care since we need to join indices back with the original data.
library(tidybayes)
fit_draws <- sir_fit %>%
gather_draws(yrep[i], ys[i, state_index]) %>%
ungroup() %>%
left_join(sim_cases %>% mutate(i = row_number())) %>%
left_join(tibble(state = c('S', 'I', 'R')) %>%
mutate(state_index = row_number())) %>%
mutate(state = if_else(is.na(state), .variable, state)) %>%
select(state, .value, .chain, .iteration, .draw, cases, t) %>%
spread(state, .value)Once formatted, we can use tidybayes to summarize flexibly:
Now that we have a base model, let’s test it on some real data.
There is a canonical dataset describing a particularly virulent flu outbreak at a boarding school in 1978. Later, it was determined that the flu was H1N1, however this was not known during the event.
These data are available via the outbreaks library.
We are given the following in the data description:
The index case was infected by 1978-01-10, and had febrile illness from 1978-01-15 to 1978-01-18. 512 boys out of 763 became ill.
The data include daily counts of children both in bed, and who are convalescing.
It’s worth noting that the minimal date in our data is: 1978-01-22.
These data are missing details of the index case. We will hard-code this information as our initial conditions.
We will also ignore the “convalescent” category, for now, and lump them in with the “recovered” group (neither infectious nor susceptible).
# prepare data
flu_N_t <- 763
date0 <- min(influenza_england_1978_school$date)
flu_t <- as.integer(influenza_england_1978_school$date - date0)
flu_cases <- influenza_england_1978_school$in_bed
flu_data <- list(N = flu_N_t,
i0 = 1,
N_t = length(flu_t),
t = flu_t,
cases = flu_cases)
# sample model
flu_fit <- basic_sir$sample(data = flu_data,
seed = params$seed, refresh = 0)As before, we format the posterior estimates & summarize the expected dynamics.
library(tidybayes)
fit_draws <- flu_fit %>%
gather_draws(yrep[i], ys[i, state_index]) %>%
ungroup() %>%
left_join(tibble(t = flu_t, cases = flu_cases) %>% mutate(i = row_number())) %>%
left_join(tibble(state = c('S', 'I', 'R')) %>%
mutate(state_index = row_number())) %>%
mutate(state = if_else(is.na(state), .variable, state)) %>%
select(state, .value, .chain, .iteration, .draw, cases, t) %>%
spread(state, .value)fit_draws %>%
gather(state, .value, S, I, R) %>%
ggplot(aes(x = t, y = .value, group = state, fill = state, colour = state)) +
stat_lineribbon(.width = c(0.8), alpha = 0.4) +
geom_point(aes(y = cases,
colour = NULL,
fill = NULL)) +
scale_y_continuous('N') +
scale_x_continuous('time') +
ggtitle('Posterior estimates of population dynamics for flu outbreak') +
labs(caption = str_c('Ribbons show posterior median and 80% CrI',
'Points reflect observed values',
sep = "\n")
)In addition, we compare the distribution of expected counts against that of the observed data.
fit_draws %>%
ggplot(aes(x = yrep, colour = 'predicted', fill = 'predicted')) +
stat_halfeye(alpha = 0.3) +
stat_histinterval(aes(x = cases, colour = 'observed', fill = 'observed'), data = tibble(cases = flu_cases),
alpha = 0.2) +
ggtitle('Observed vs Predicted Case Counts for Flu Outbreak') +
scale_fill_discrete('Type') +
scale_colour_discrete('Type') +
scale_y_continuous('Density') +
scale_x_continuous('Value')There is a higher incidence of counts in the 250-person range than expected by the model. However, it’s possible that our model is still useful, though imperfect.
The assumptions of the basic SIR model do not apply to most pandemics.
Typically, it is a starting point. Let’s consider COVID-19 as an example.
Here are some example case counts from Switzerland
covid_N_t<- 5e+03
df_swiss <- df_swiss %>%
filter(report_dt > 0)
date0 <- min(df_swiss$date)
covid_t <- as.integer(df_swiss$date - date0)
covid_cases <- df_swiss$report_dt
fit_file <- here::here('fits', 'covid-basic_sir_fit3.Rds')
if (!fs::file_exists(fit_file)) {
covid_data <- list(N = covid_N_t,
i0 = 1,
N_t = length(covid_t),
t = covid_t,
cases = covid_cases)
covid_fit <- basic_sir$sample(data = covid_data, seed = params$seed, refresh = 0)
covid_fit$save_object(fit_file)
} else {
covid_fit <- readRDS(fit_file)
}Formatting posterior estimates takes some care
library(tidybayes)
fit_draws <- covid_fit %>%
gather_draws(yrep[i], ys[i, state_index]) %>%
ungroup() %>%
left_join(tibble(t = covid_t, cases = covid_cases) %>% mutate(i = row_number())) %>%
left_join(tibble(state = c('S', 'I', 'R')) %>%
mutate(state_index = row_number())) %>%
mutate(state = if_else(is.na(state), .variable, state)) %>%
select(state, .value, .chain, .iteration, .draw, cases, t) %>%
spread(state, .value)In this fit, we are providing an assumed value for N, the total population size. However, the actual number of susceptible (effectively susceptible) individuals is not known.
We also see a pretty marked discrepancy between the observed counts and the model estimates.
Both suggest we should revise the model, perhaps to estimate N.
Now, we modify our model to estimate N rather than use an assumed value.
Here is the Stan program for the revised model.
functions {
real[] sir2(real t, real[] y, real[] theta,
real[] x_r, int[] x_i) {
real beta = theta[1];
real gamma = theta[2];
real initS = theta[3];
real initI = theta[4];
real S = y[1] + initS;
real I = y[2] + initI;
real R = y[3];
real N = S + I + R;
real dy_dt[3];
dy_dt[1] = -beta * I * S / N;
dy_dt[2] = beta * I * S / N - gamma * I;
dy_dt[3] = gamma * I;
return(dy_dt);
}
real[,] simulate_sir2(real[] t, real initS, real initI, real beta, real gamma) {
int N_t = num_elements(t);
real t0 = -0.1;
real theta[4] = {beta, gamma, initS, initI};
real y[N_t, 3];
y = integrate_ode_rk45(sir2, rep_array(0.0, 3), t0, t, theta, rep_array(0.0, 0), rep_array(0, 0));
return(y);
}
int[] simulate_cases2_rng(real[] t, real initS, real initI, real beta, real gamma, real phi) {
int N_t = num_elements(t);
real y[N_t, 3] = simulate_sir2(t, initS, initI, beta, gamma);
int cases[N_t] = neg_binomial_2_rng(y[,2], phi);
return(cases);
}
}
data {
int N_t;
real t[N_t];
int cases[N_t];
}
parameters {
real<lower = 0> beta;
real<lower = 0> gamma;
real<lower = 0> phi_inv;
real log_initS_raw;
real<lower = 0> initI;
}
transformed parameters {
real phi = 1. / phi_inv;
real<lower = 0> initS = 1000 + exp(log_initS_raw);
real ys[N_t, 3] = simulate_sir2(t, initS, initI, beta, gamma);
}
model {
// priors
phi_inv ~ exponential(5);
beta ~ normal(0, 5);
gamma ~ normal(0.4, 0.5);
log_initS_raw ~ normal(5, 10);
initI ~ std_normal();
// likelihood
cases ~ neg_binomial_2(ys[,2], phi);
}
generated quantities {
int yrep[N_t] = neg_binomial_2_rng(ys[,2], phi);
real log_lik[N_t];
for (i in 1:N_t)
log_lik[i] = neg_binomial_2_lpmf(cases[i] | ys[i,2], phi);
}
Next we formatting posterior estimates
library(tidybayes)
fit_draws2 <- covid_fit2 %>%
gather_draws(yrep[i], ys[i, state_index]) %>%
left_join(covid_fit2 %>% spread_draws(initS, initI)) %>%
ungroup() %>%
left_join(tibble(t = covid_t, cases = covid_cases) %>% mutate(i = row_number())) %>%
left_join(tibble(state = c('S', 'I', 'R')) %>%
mutate(state_index = row_number())) %>%
mutate(state = if_else(is.na(state), .variable, state)) %>%
select(state, .value, .chain, .iteration, .draw, cases, t, initS, initI) %>%
spread(state, .value) %>%
mutate(S = S + initS,
I = I + initI)In practice, the SIR model is a starting point.
There are any number of extensions that have been used, not only for COVID-19 but for other outbreaks.
As an example, consider this 8-compartment model used to describe COVID-19 disease dynamics.
The observed data include daily counts of Case, Hospitalization, and Death counts. Each is modeled using a negative binomial given the estimated true prevalence.
src: Keller, JP, Zhou, T, Kaplan, A, Anderson, GB, Zhou, W. Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model. Statistics in Medicine. 2022; 41( 15): 2745- 2767. doi:10.1002/sim.9382
In addition, their model includes particular covariate effects on time-varying infectiousness parameters.
For example, one informative covariate was the degree of mobility (from aggregate cell-phone tracking data)
Keller, JP, Zhou, T, Kaplan, A, Anderson, GB, Zhou, W. Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model. Statistics in Medicine. 2022; 41( 15): 2745- 2767. doi:10.1002/sim.9382
Most of the content for this presentation is based on the following case study:
In addition, I report and am motivated by the work in this paper:
Full code available at: